Document retrieval from wikipedia data Using TF-IDF

Fire up GraphLab Create


In [2]:
import graphlab

Load some text data - from wikipedia, pages on people


In [7]:
people = graphlab.SFrame('people_wiki.gl/')

Let's View data set with head to view some of the rows let's 5 rows

Data contains: link to wikipedia article, name of person, text of article.


In [8]:
people.head(5)


Out[8]:
URI name text
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
[5 rows x 3 columns]


In [9]:
len(people)


Out[9]:
59071

Explore the dataset and checkout the text it contains

Exploring the entry for president Obama, Let's take subset of data as obama


In [10]:
obama = people[people['name'] == 'Barack Obama']

obama variable contain all the information about Barack Obama from wiki


In [11]:
obama


Out[11]:
URI name text
<http://dbpedia.org/resou
rce/Barack_Obama> ...
Barack Obama barack hussein obama ii
brk husen bm born august ...
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [14]:
obama['text']

Exploring the entry for actor George Clooney, in variable clooney


In [15]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

Get the word counts for Obama article and store result in same data set as a new variable name word_count


In [16]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [17]:
print obama['word_count']


[{'operations': 1, 'represent': 1, 'office': 2, 'unemployment': 1, 'doddfrank': 1, 'over': 1, 'unconstitutional': 1, 'domestic': 2, 'major': 1, 'years': 1, 'against': 1, 'proposition': 1, 'seats': 1, 'graduate': 1, 'debate': 1, 'before': 1, 'death': 1, '20': 2, 'taxpayer': 1, 'representing': 1, 'obamacare': 1, 'barack': 1, 'to': 14, '4': 1, 'policy': 2, '8': 1, 'he': 7, '2011': 3, '2010': 2, '2013': 1, '2012': 1, 'bin': 1, 'then': 1, 'his': 11, 'march': 1, 'gains': 1, 'cuba': 1, 'school': 3, '1992': 1, 'new': 1, 'not': 1, 'during': 2, 'ending': 1, 'continued': 1, 'presidential': 2, 'states': 3, 'husen': 1, 'osama': 1, 'californias': 1, 'equality': 1, 'prize': 1, 'lost': 1, 'made': 1, 'inaugurated': 1, 'january': 3, 'university': 2, 'rights': 1, 'july': 1, 'gun': 1, 'stimulus': 1, 'rodham': 1, 'troop': 1, 'withdrawal': 1, 'brk': 1, 'nine': 1, 'where': 1, 'referred': 1, 'affordable': 1, 'attorney': 1, 'on': 2, 'often': 1, 'senate': 3, 'regained': 1, 'national': 2, 'creation': 1, 'related': 1, 'hawaii': 1, 'born': 2, 'second': 2, 'defense': 1, 'election': 3, 'close': 1, 'operation': 1, 'insurance': 1, 'sandy': 1, 'afghanistan': 2, 'initiatives': 1, 'for': 4, 'reform': 1, 'house': 2, 'review': 1, 'representatives': 2, 'ended': 1, 'current': 1, 'state': 1, 'won': 1, 'limit': 1, 'victory': 1, 'unsuccessfully': 1, 'reauthorization': 1, 'keynote': 1, 'full': 1, 'patient': 1, 'august': 1, 'degree': 1, '44th': 1, 'bm': 1, 'mitt': 1, 'attention': 1, 'delegates': 1, 'lgbt': 1, 'job': 1, 'harvard': 2, 'term': 3, 'served': 2, 'ask': 1, 'november': 2, 'debt': 1, 'by': 1, 'wall': 1, 'care': 1, 'received': 1, 'great': 1, 'signed': 3, 'libya': 1, 'receive': 1, 'of': 18, 'months': 1, 'urged': 1, 'foreign': 2, 'american': 3, 'protection': 2, 'economic': 1, 'act': 8, 'military': 4, 'hussein': 1, 'or': 1, 'first': 3, 'control': 4, 'named': 1, 'clinton': 1, 'dont': 2, 'campaign': 3, 'russia': 1, 'civil': 1, 'reinvestment': 1, 'into': 1, 'address': 1, 'primary': 2, 'community': 1, 'mccain': 1, 'down': 1, 'hook': 1, '63': 1, 'americans': 1, 'elementary': 1, 'total': 1, 'earning': 1, 'repeal': 1, 'from': 3, 'raise': 1, 'district': 1, 'spending': 1, 'republican': 2, 'legislation': 1, 'three': 1, 'relations': 1, 'nobel': 1, 'start': 1, 'tell': 1, 'iraq': 4, 'convention': 1, 'resulted': 1, 'john': 1, 'was': 5, '2012obama': 1, 'form': 1, 'that': 1, 'tax': 1, 'sufficient': 1, 'republicans': 1, 'strike': 1, 'hillary': 1, 'street': 1, 'arms': 1, 'honolulu': 1, 'filed': 1, 'worked': 1, 'hold': 1, 'with': 3, 'obama': 9, 'ii': 1, 'has': 4, '1997': 1, '1996': 1, 'whether': 1, 'reelected': 1, 'budget': 1, 'us': 6, 'nations': 1, 'recession': 1, 'while': 1, 'taught': 1, 'marriage': 1, 'policies': 1, 'promoted': 1, 'called': 1, 'and': 21, 'supreme': 1, 'ordered': 3, 'nominee': 2, 'process': 1, '2000in': 1, 'is': 2, 'romney': 1, 'briefs': 1, 'defeated': 1, 'general': 1, '13th': 1, 'as': 6, 'at': 2, 'in': 30, 'sought': 1, 'organizer': 1, 'shooting': 1, 'increased': 1, 'normalize': 1, 'lengthy': 1, 'united': 3, 'court': 1, 'recovery': 1, 'laden': 1, 'laureateduring': 1, 'peace': 1, 'administration': 1, '1961': 1, 'illinois': 2, 'other': 1, 'which': 1, 'party': 3, 'primaries': 1, 'sworn': 1, 'relief': 2, 'war': 1, 'columbia': 1, 'combat': 1, 'after': 4, 'islamic': 1, 'running': 1, 'levels': 1, 'two': 1, 'involvement': 3, 'response': 3, 'included': 1, 'president': 4, 'law': 6, 'nomination': 1, '2008': 1, 'a': 7, '2009': 3, 'chicago': 2, 'constitutional': 1, 'defeating': 1, 'treaty': 1, 'federal': 1, '2007': 1, '2004': 3, 'african': 1, 'the': 40, 'democratic': 4, 'consumer': 1, 'began': 1, 'terms': 1}]

Sort the word counts for the Obama article, for sorting we use stack with new column name as word and count

Turning dictonary of word counts into a table


In [18]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

Sorting the word counts to show most common words at the top


In [22]:
obama_word_count_table.head(5)


Out[22]:
word count
normalize 1
sought 1
combat 1
continued 1
unconstitutional 1
[5 rows x 2 columns]


In [23]:
obama_word_count_table.sort('count',ascending=False)


Out[23]:
word count
the 40
in 30
and 21
of 18
to 14
his 11
obama 9
act 8
a 7
he 7
[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Most common words include uninformative words like "the", "in", "and",...but some of the words does't contain any meaning full informaion about these article we can remove these words some time we call these words as stop words

Compute TF-IDF for the corpus

To give more weight to informative words, we weigh them by their TF-IDF scores. TF-IDF basically way to score the importance of words in document based on how frequently they appear across multiple doc, but this will work for the whole corpus so we apply tf-idf to the whole


In [28]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head(5)


Out[28]:
URI name text word_count
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
{'since': 1, 'carltons':
1, 'being': 1, '2005' ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
{'precise': 1, 'thomas':
1, 'closely': 1, ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
{'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
{'all': 1,
'bauforschung': 1, ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
{'legendary': 1,
'gangstergenka': 1, ...
tfidf
{'since':
1.455376717308041, ...
{'precise':
6.44320060695519, ...
{'just':
2.7007299687108643, ...
{'all':
1.6431112434912472, ...
{'legendary':
4.280856294365192, ...
[5 rows x 5 columns]


In [30]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
tfidf


Out[30]:
docs
{'since':
1.455376717308041, ...
{'precise':
6.44320060695519, ...
{'just':
2.7007299687108643, ...
{'all':
1.6431112434912472, ...
{'legendary':
4.280856294365192, ...
{'now': 1.96695239252401,
'currently': ...
{'exclusive':
10.455187230695827, ...
{'taxi':
6.0520214560945025, ...
{'houston':
3.935505942157149, ...
{'phenomenon':
5.750053426395245, ...
[59071 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

add tf-idf docs to people dataset with a new variable called tfidf


In [31]:
people['tfidf'] = tfidf['docs']

Examine the TF-IDF for the Obama article, this new variable with tfidf


In [32]:
obama = people[people['name'] == 'Barack Obama']

In [33]:
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)


Out[33]:
word tfidf
obama 43.2956530721
act 27.678222623
iraq 17.747378588
control 14.8870608452
law 14.7229357618
ordered 14.5333739509
military 13.1159327785
involvement 12.7843852412
response 12.7843852412
democratic 12.4106886973
[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Words with highest TF-IDF are much more informative. so we sorted words according to the TF-IDF

Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.


In [34]:
clinton = people[people['name'] == 'Bill Clinton']

In [35]:
beckham = people[people['name'] == 'David Beckham']

Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) for computing the distance between the two documents

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.


In [36]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])


Out[36]:
0.8339854936884276

In [38]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])


Out[38]:
0.9791305844747478

less the distance between two document more they are similar two each other like obam and clinton are more similar then obama and beckham

Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.


In [40]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')


PROGRESS: Starting brute force nearest neighbors model training.

Applying the nearest-neighbors model for retrieval

Who is closest to Obama? for that we use query with model


In [41]:
knn_model.query(obama)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 136.686ms    |
PROGRESS: | 0            | 5602    | 9.4835      | 343.897ms    |
PROGRESS: | 0            | 11164   | 18.8993     | 564.768ms    |
PROGRESS: | 0            | 15380   | 26.0365     | 789.289ms    |
PROGRESS: | 0            | 20757   | 35.1391     | 1.01s        |
PROGRESS: | 0            | 25677   | 43.468      | 1.23s        |
PROGRESS: | 0            | 31202   | 52.8212     | 1.47s        |
PROGRESS: | 0            | 36100   | 61.1129     | 1.68s        |
PROGRESS: | 0            | 40930   | 69.2895     | 1.91s        |
PROGRESS: | 0            | 46050   | 77.957      | 2.13s        |
PROGRESS: | 0            | 50726   | 85.8729     | 2.36s        |
PROGRESS: | 0            | 55514   | 93.9784     | 2.58s        |
PROGRESS: | 0            | 58721   | 99.4075     | 2.81s        |
PROGRESS: | Done         |         | 100         | 2.87s        |
PROGRESS: +--------------+---------+-------------+--------------+
Out[41]:
query_label reference_label distance rank
0 Barack Obama 0.0 1
0 Joe Biden 0.794117647059 2
0 Joe Lieberman 0.794685990338 3
0 Kelly Ayotte 0.811989100817 4
0 Bill Clinton 0.813852813853 5
[5 rows x 4 columns]

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.

Other examples of document retrieval


In [42]:
swift = people[people['name'] == 'Taylor Swift']

In [44]:
knn_model.query(swift)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 92.438ms     |
PROGRESS: | 0            | 4012    | 6.79183     | 318.015ms    |
PROGRESS: | 0            | 6950    | 11.7655     | 541.453ms    |
PROGRESS: | 0            | 10953   | 18.5421     | 765.986ms    |
PROGRESS: | 0            | 16173   | 27.3789     | 994.226ms    |
PROGRESS: | 0            | 20450   | 34.6194     | 1.21s        |
PROGRESS: | 0            | 25428   | 43.0465     | 1.45s        |
PROGRESS: | 0            | 30178   | 51.0877     | 1.66s        |
PROGRESS: | 0            | 34714   | 58.7666     | 1.88s        |
PROGRESS: | 0            | 38071   | 64.4496     | 2.11s        |
PROGRESS: | 0            | 42002   | 71.1043     | 2.33s        |
PROGRESS: | 0            | 46054   | 77.9638     | 2.56s        |
PROGRESS: | 0            | 50833   | 86.0541     | 2.78s        |
PROGRESS: | 0            | 54260   | 91.8556     | 3.01s        |
PROGRESS: | 0            | 57665   | 97.6198     | 3.25s        |
PROGRESS: | Done         |         | 100         | 3.33s        |
PROGRESS: +--------------+---------+-------------+--------------+
Out[44]:
query_label reference_label distance rank
0 Taylor Swift 0.0 1
0 Carrie Underwood 0.76231884058 2
0 Alicia Keys 0.764705882353 3
0 Jordin Sparks 0.769633507853 4
0 Leona Lewis 0.776119402985 5
[5 rows x 4 columns]


In [30]:
jolie = people[people['name'] == 'Angelina Jolie']

In [31]:
knn_model.query(jolie)


PROGRESS: Starting pairwise querying...
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 24.658ms     |
PROGRESS: | Done         |         | 100         | 149.909ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[31]:
query_label reference_label distance rank
0 Angelina Jolie 0.0 1
0 Brad Pitt 0.784023668639 2
0 Julianne Moore 0.795857988166 3
0 Billy Bob Thornton 0.803069053708 4
0 George Clooney 0.8046875 5
[5 rows x 4 columns]


In [45]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [46]:
knn_model.query(arnold)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 53.147ms     |
PROGRESS: | 0            | 6114    | 10.3503     | 295.705ms    |
PROGRESS: | 0            | 11182   | 18.9298     | 502.157ms    |
PROGRESS: | 0            | 16511   | 27.9511     | 729.308ms    |
PROGRESS: | 0            | 21738   | 36.7998     | 951.159ms    |
PROGRESS: | 0            | 27106   | 45.8872     | 1.19s        |
PROGRESS: | 0            | 32153   | 54.4311     | 1.40s        |
PROGRESS: | 0            | 37346   | 63.2222     | 1.62s        |
PROGRESS: | 0            | 42853   | 72.5449     | 1.84s        |
PROGRESS: | 0            | 47690   | 80.7334     | 2.07s        |
PROGRESS: | 0            | 53085   | 89.8664     | 2.29s        |
PROGRESS: | 0            | 57544   | 97.415      | 2.52s        |
PROGRESS: | Done         |         | 100         | 2.62s        |
PROGRESS: +--------------+---------+-------------+--------------+
Out[46]:
query_label reference_label distance rank
0 Arnold Schwarzenegger 0.0 1
0 Jesse Ventura 0.818918918919 2
0 John Kitzhaber 0.824615384615 3
0 Lincoln Chafee 0.833876221498 4
0 Anthony Foxx 0.833910034602 5
[5 rows x 4 columns]


In [ ]: